Search CORE

7 research outputs found

Multilingual unsupervised word alignment models and their application

Author: Mansouri Bigvand Anahita
Publication venue
Publication date: 05/03/2021
Field of study

Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection. First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality. Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network. Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines

Simon Fraser University Institutional Repository

A FAST ALGORITHM FOR COMPUTING HIGHLY SENSITIVE MULTIPLE SPACED SEEDS

Author: Bigvand Anahita Mansouri
Publication venue: Scholarship@Western
Publication date: 01/01/2011
Field of study

The main goal of homology search is to find similar segments, or local alignments, be tween two DNA or protein sequences. Since the dynamic programming algorithm of Smith- Waterman is too slow, heuristic methods have been designed to achieve both efficiency and accuracy. Seed-based methods were made well known by their use in BLAST, the most widely used software program in biological applications. The seed of BLAST trades sensitivity for speed and spaced seeds were introduced in PatternHunter to achieve both. Several seeds are better than one and near perfect sensitivity can be obtained while maintaining the speed. There fore, multiple spaced seeds quickly became the state-of-the-art in similarity search, being em ployed by many software programs. However, the quality of these seeds is crucial and comput ing optimal multiple spaced seeds is NP-hard. All but one of the existing heuristic algorithms for computing good seeds are exponential. Our work has two main goals. First we engineer the only existing polynomial-time heuristic algorithm to compute better seeds than any other program, while running orders of magnitude faster. Second, we estimate its performance by comparing its seeds with the optimal seeds in a few practical cases. In order to make the computation feasible, a very fast implementation of the sensitivity function is provided

Scholarship@Western

An Easily Extensible HMM Word Aligner

Author: Gū Jetic
Mansouri-Bigvand Anahita
Sarkar Anoop
Publication venue
Publication date: 24/10/2018
Field of study

In this paper, we present a new word aligner with built-in support for alignment types, as well as comparisons between various models and existing aligner systems. It is an open source software that can be easily extended to use models of users\u27 own design. We expect it to suffice the academics as well as scientists working in the industry to do word alignment, as well as experimenting on their own new models. Here in the present paper, the basic designs and structures will be introduced. Examples and demos of the system are also provide

Simon Fraser University Institutional Repository

Seeds for effective oligonucleotide design

Author: A Califano
Anahita Mansouri Bigvand
B Ma
B Ma
F Li
H Nielsen
J Rouillard
J Rouillard
L Ilie
L Ilie
L Kaderali
L Noe
Lucian Ilie
M David
M Girdea
M Li
N Reymond
S Altschul
S Feng
S Rahman
S Rimour
Shima Khoshraftar
Silvana Ilie
T Smith
WH Chung
Y Chen
Z Bozdech
Z He
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms.\ud Results: We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F-score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones.\ud Conclusions: Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide desig

CiteSeerX

Scholarship@Western

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

SpEED: fast computation of sensitive spaced seeds

Author: Anahita Mansouri Bigvand
Lucian Ilie
Silvana Ilie
Publication venue
Publication date: 01/01/2011
Field of study

Multiple spaced seeds represent the current state-of-the art for similarity search in bioinformatics, with applications in various areas such as sequence alignment, read mapping, oligonucleotide design, etc. We present SpEED, a software program that computes highly sensitive multiple spaced seeds. SpEED can be several orders of magnitude faster and computes better seeds than the existing leading software programs. Availability: The source code of SpEED is freely available at www.csd.uwo.ca/˜ilie/SpEED/\u

CiteSeerX

SpEED: fast computation of sensitive spaced seeds

Author: Altschul
Anahita Mansouri Bigvand
Buhler
Califano
David
Feng
Homer
Ilie
Kucherov
Li
Lucian Ilie
Ma
Ma
Noé
Silvana Ilie
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref